Your homework should be completed in R Markdown or Quarto and Knitted to an html or pdf document. You will ``turn in” this homework by uploading to your GitHub Math_3190_Assignment repository in the Homework directory.

Some of the parts in this homework, like 1a, 2a, 2b, and 2c require writing down some math-heavy expressions. You may either type it up using LaTeX style formatting in R Markdown, or you can write it by hand (neatly) and include pictures or scans of your work in your R Markdown document.

Problem 1 (33 points)

Suppose we want to find the value of \(x\) and the objective function value for finding the global maximum and the minimum of the function \(f(x)=\ln(1+x^2)(1+x)e^{-x^2}\). We will implement gradient descent and Newton’s method for finding these extrema.

Part a (4 points)

Find \(f'(x)\) for this function. Do this by hand as a nice derivative refresher. Feel free to check your answer using something like Wolfram Alpha.

Response: Where the triple product \(fgh\) is defined, then \(\frac{d}{dx}fgh = f'gh+fg'h+fgh'\), and so:

\[ f'(x)=\dfrac{2x(x+1)e^{-x^2}}{x^2+1}+e^{-x^2}\ln(x^2+1)(-2x^2-2x+1) \]

Part b (12 points)

Write a function that implements gradient descent. Use a backtracking line search each iteration to find \(\gamma_k\): the step size. This function should take eight inputs:

  • The starting value (or vector)
  • The function to optimize
  • The function’s derivative (or gradient)
  • A logical indicating if we want to find a max with a default of FALSE.
  • The maximum number of iterations with a default of 200
  • The initial step size with a default of 1
  • The line search beta parameter with a default of 0.5.
  • The stopping tolerance with a default of 1e-10.

The output of this function should be a list with the \(x\) value (or vector) that optimizes the function, the function value at that point, and the number of iterations it took to converge. This function should work in the case of one dimension or multiple dimensions since you’ll use it again in problem 2.

gradient_descent <- function(x_k, f, gradient, findmax = FALSE, max_iter = 200,
                             gamma_k = 1, beta = 0.5, tolerance = 1e-10) {
  
  if (findmax) {
    obj_func <- function(x) -f(x)
    obj_gradient <- function(x) -gradient(x)
    } else {
      obj_func <- f
      obj_gradient <- gradient
    }
  
  for (k in 1:max_iter) {
    
    gamma <- gamma_k
    f_eval <- f(x_k)
    
    #  line search
    while (obj_func(x_k - gamma * obj_gradient(x_k)) >
           obj_func(x_k) - gamma/4 * sum(obj_gradient(x_k) ^ 2) ) {
      gamma <- beta * gamma
    }
    
    # Update x_k
    x_new <- x_k - gamma * obj_gradient(x_k)
    
    # Check convergence
    if ( norm(x_k - x_new, type = '2') < tolerance) {
      break
    }
    
    x_k <- x_new
  }
  
  results <- c(x_k, f_eval, k)
  
  if(findmax){
    print(paste("The maximum is:", f_eval))
    print("Max at x_k / betas :")
    print(x_k)
  }else{
    print(paste("The minimum is:", f_eval))
    print("Min at x_k / beta:")
    print(x_k)
  }
  
  print(paste("with", k, "iterations"))
  
}

Part c (3 points)

Use your gradient descent function to find the global min of \(f(x)=\ln(1+x^2)(1+x)e^{-x^2}\). Try using several starting points. Keep track of and report the number of iterations needed to converge at the different starting points.

fun <- function(x){
  f_x <- log(1+x^2)*(1+x)*exp(-x^2)
}

fun_prime <- function(x){
  (2*x*(x+1)*exp(-x^2)) / (x^2 + 1) + exp(-x^2)*log(x^2 + 1)*(-2*x^2 - 2*x + 1)
}


gradient_descent(x_k= 1, f=fun, gradient=fun_prime, findmax = F, gamma=1, beta=0.5)
## [1] "The minimum is: 0.000148535426099015"
## [1] "Min at x_k / beta:"
## [1] 3.347721
## [1] "with 200 iterations"
gradient_descent(x_k= 0.5, f=fun, gradient=fun_prime, findmax = F, gamma=1, beta=0.5)
## [1] "The minimum is: 0"
## [1] "Min at x_k / beta:"
## [1] -4.373893e-09
## [1] "with 7 iterations"
gradient_descent(x_k= -0.5, f=fun, gradient=fun_prime, findmax = F, gamma=1, beta=0.5)
## [1] "The minimum is: 0"
## [1] "Min at x_k / beta:"
## [1] -3.155042e-12
## [1] "with 8 iterations"
gradient_descent(x_k= -1, f=fun, gradient=fun_prime, findmax = F, gamma=1, beta=0.5)
## [1] "The minimum is: -0.0623236248471729"
## [1] "Min at x_k / beta:"
## [1] -1.469439
## [1] "with 31 iterations"

Response: The minimum of -0.0623 is obtained in 31 iterations. The minimization becomes easily stuck around 0 for \(x_k=-0.5, 0.5\), converging in seven to eight iterations to \(0\), whereas it does not converge for \(x_k=1\)

Part d (3 points)

Use your gradient descent function to find the global max of \(f(x)=\ln(1+x^2)(1+x)e^{-x^2}\). Again, try several starting points. Keep track of and report the number of iterations needed to converge at the different starting points.

x_k_minus1_max <- gradient_descent(x_k=-1 , f=fun, gradient=fun_prime, findmax = T, gamma=0.9, beta=0.5)
## [1] "The maximum is: 0.0879346149477647"
## [1] "Max at x_k / betas :"
## [1] -0.5409866
## [1] "with 13 iterations"
x_k_minus.5_max <- gradient_descent(x_k=-0.5 , f=fun, gradient=fun_prime, findmax = T, gamma=0.9, beta=0.5)
## [1] "The maximum is: 0.0879346149477647"
## [1] "Max at x_k / betas :"
## [1] -0.5409866
## [1] "with 11 iterations"
x_k_.5_max <- gradient_descent(x_k=0.5 , f=fun, gradient=fun_prime, findmax = T, gamma=0.9, beta=0.5)
## [1] "The maximum is: 0.510181611312493"
## [1] "Max at x_k / betas :"
## [1] 0.9868668
## [1] "with 14 iterations"
x_k_1_max <- gradient_descent(x_k=1 , f=fun, gradient=fun_prime, findmax = T, gamma=0.9, beta=0.5)
## [1] "The maximum is: 0.510181611312493"
## [1] "Max at x_k / betas :"
## [1] 0.9868668
## [1] "with 10 iterations"

Response: The maximum of \(0.5102\) is obtained in five iterations obtained by initializing \(x_k=1\). The maximization seems to become easily stuck in particular for negative values, maximizing at \(0.0879\) for \(x_k=-1, -0.5\).

Part e (6 points)

Use the fact that \[ f''(x)=\dfrac{2e^{-x^2}}{(1+x^2)^2}\left((2x^3+2x^2-3x-1)(1+x^2)^2\ln(1+x^2)-4x^5-4x^4-3x^3-5x^2+3x+1\right) \] to implement Newton’s method to find the global max and the global min. Use the same starting points you did in parts c and d. Keep track of and report the number of iterations needed to converge at the different starting points. And yes, that second derivative is very messy! This is one reason why Newton’s method is less popular. You don’t have to write a function for this part to implement Newton’s method, but you can if you’d like.

fun_second <- function(x){
  (2*exp(-x^2)/(1+x^2)^2) * ((2*x^3 +2*x^2 -3*x -1) * (1+x^2)^2 
                             * log(1+x^2) -4*x^5 -4*x^4 -3*x^3 -5*x^2 +3*x +1)
}


# for(k in 1:maxit){
#   x_new <- x_k - fun_prime(x_k)/fun_second(x_k)
#   if(abs(x_k-x_new) < 1e-10){
#     break
#   }
#   x_k <- x_new
# }

newtons_method <- function(x_k,f, f_prime, f_second, findmax = FALSE, 
                           max_iter =200, tolerance = 1e-10) {
  
  # objective function and gradient based on findmax
  if (findmax) {
    f_prime_2 <- function(x) -f_prime(x)
    f_second_2 <- function(x) -f_second(x)
  } else {
    f_prime_2 <- function(x) f_prime(x)
    f_second_2 <- function(x) f_second(x)
  }
  
  #  newton's groove
  
  for (k in 1:max_iter) {
    
    f_eval <- f(x_k)
    
    x_new <- x_k - f_prime_2(x_k)/f_second_2(x_k)
    
    if( abs(x_k - x_new) < 1e-10){
      break
    }
    x_k <- x_new
  }
  results <- c(x_k, f_eval, k)
  
  if(findmax){
    print("The maximum is:")
    print(f_eval)
  }else{
    print("The minimum is:")
    print(f_eval)
    }
  print(results)  
}


###

newtons_method(x_k=-1,f=fun, f_prime=fun_prime, f_second=fun_second, 
               findmax = F, max_iter = 200, tolerance = 1e-10) 
## [1] "The minimum is:"
## [1] -2.902035e-94
## [1] -1.485585e+01 -2.902035e-94  2.000000e+02
newtons_method(x_k=-1.03,f=fun, f_prime=fun_prime, f_second=fun_second, 
               findmax = F, max_iter = 200, tolerance = 1e-10)
## [1] "The minimum is:"
## [1] -0.06232362
## [1] -1.46943867 -0.06232362  7.00000000
newtons_method(x_k=1, f=fun, f_prime=fun_prime, f_second=fun_second, 
               findmax = T, max_iter = 200, tolerance = 1e-10) 
## [1] "The maximum is:"
## [1] 0.5101816
## [1] 0.9868668 0.5101816 4.0000000

Part f (3 points)

Compare the number of iterations needed for convergence for gradient descent and for Newton’s method. Response: Line backtracking with gradient descent allowed to converge on the minimum in 31 iterations and converged on the maximum in 1 iteration given the nearby \(x_k=1\). Newton’s method repeatedly failed to converge on the minimum without a very nearby initializing argument, \(1.03\leq x_k\), whereas the maximum was converged upon in 6 iterations.

Part g (2 points)

Use the optimize() function in R to find the global max and min.

opt <- optimize(fun,c(-100,100))
print(paste('Minimum at x=', opt$minimum))
## [1] "Minimum at x= -1.46943401434658"
print(paste('y=',opt$objective))
## [1] "y= -0.0623236248421882"
opt <- optimize(fun,c(-100,100), maximum=T)
print(paste('Maximum at x=', opt$maximum))
## [1] "Maximum at x= 0.986868461516571"
print(paste('y=',opt$objective))
## [1] "y= 0.510181611309466"

Problem 2 (27 points)

We implemented optimization to maximize the likelihood for a Poisson regression problem in the notes to estimate the \(\beta_i\) values for \(i=0,\dots,p-1\). Let’s do something similar to find the \(\boldsymbol{\beta}\) vector by maximizing the likelihood in logistic regression.

In logistic regression, we attempt to predict the probability of a success for a given case. We can use Bernoulli random variable for this. For a Bernoulli random variable, \(P(Y=y_i) = p_i^{y_i}(1-p_i)^{1-y_i}\) for \(y_i=0,1\) where \(p_i\) is the probability of a success. In our case, for logistic regression, we have \(\text{logit}(p_i)=\mathbf{X}_i\boldsymbol{\beta}\). That means \(p_i = \dfrac{\exp(\mathbf{X}_i\boldsymbol{\beta})}{1+\exp(\mathbf{X}_i\boldsymbol{\beta})}\).

Part a (4 points)

Find the likelihood function, \(L(\boldsymbol{\beta}|\mathbf{X},\boldsymbol{y})\), for the logistic regression for a sample of size \(n\).

Problem 1a response

Part b (6 points)

Show that the log likelihood function is \[ \ell(\boldsymbol{\beta}|\mathbf{X},\boldsymbol{y})=\sum_{i=1}^ny_i\ln\left(\dfrac{\exp(\mathbf{X}_i\boldsymbol{\beta})}{1+\exp(\mathbf{X}_i\boldsymbol{\beta})}\right) + \sum_{i=1}^n(1-y_i)\ln\left(\dfrac{1}{1+\exp(\mathbf{X}_i\boldsymbol{\beta})}\right) \] and then show it can be equivalently written

\[ \ell(\boldsymbol{\beta}|\mathbf{X},\boldsymbol{y}) = \sum_{i=1}^ny_i(\mathbf{X}_i\boldsymbol{\beta}) - \sum_{i=1}^n\ln\Big(1 + \exp(\mathbf{X}_i\boldsymbol{\beta})\Big). \]

Problem 1b response

Part c (5 points)

Find the gradient of the log likelihood if we have three \(\beta\)’s. That is, if \(\mathbf{X}_i\boldsymbol{\beta}=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}\).

Problem 1c response

Part d (3 points)

In Homework 3, Problem 2, part a, you read in the adult dataset (from the UC Irvine database). This is the one that we used to predict whether a person makes over $50K a year based on some other variables. The data came from the Census Bureau in 1994 and can be found in the Data folder in my Math3190_S24 GitHub repo. More info on the dataset can be found in the “adult.names” file.

response <- GET("https://raw.githubusercontent.com/rbrown53/Math3190_Sp24/main/Data/adult.data")

adult_data <- readr::read_csv(content(response, as = "text"),col_names = FALSE) |> 
  rename(age = X1,
         work_class = X2,
         final_wght = X3,
         education = X4,
         edu_num = X5,
         marital = X6,
         occupation = X7,
         relationship = X8,
         race = X9,
         sex = X10,
         capital_gain = X11,
         capital_loss = X12,
         hrs_per_week = X13,
         country_origin = X14,
         salary = X15) |> 
  mutate(salary = as_factor(salary),
         work_class = as_factor(work_class),
         education = as_factor(education),
         marital = as_factor(marital),
         occupation = as_factor(occupation),
         relationship = as_factor(relationship),
         race = as_factor(race),
         sex = factor(sex, level =c("Female", "Male")),
         country_origin = as_factor(country_origin))
## Rows: 48842 Columns: 15
## ── Column specification ───────────────────────────────
## Delimiter: ","
## chr (9): X2, X4, X6, X7, X8, X9, X10, X14, X15
## dbl (6): X1, X3, X5, X11, X12, X13
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(adult_data)

Read in the dataset and put column names like you did in HW 3. Then fit a logistic regression model using the glm() function that predicts salary from age/10 and sex. Note: we should divide the age by 10 in our model because this makes the gradient descent more stable (for whatever reason). We can divide by 10 in the glm() function by wrapping it in the I() function. Like this: glm(salary ~ I(age/10) + sex, data = adult, family = binomial).

adult_logit_mod <- glm(salary ~ I(age/10) + sex, 
                       data = adult_data, family = binomial)

summary(adult_logit_mod)
## 
## Call:
## glm(formula = salary ~ I(age/10) + sex, family = binomial, data = adult_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.629827   0.043440  -83.56   <2e-16 ***
## I(age/10)    0.382809   0.008144   47.01   <2e-16 ***
## sexMale      1.242016   0.028505   43.57   <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 53751  on 48841  degrees of freedom
## Residual deviance: 48966  on 48839  degrees of freedom
## AIC: 48972
## 
## Number of Fisher Scoring iterations: 4

Part e (2 points)

Define y, x1, and x2. y should be a vector of 0’s and 1’s with a 1 indicating that the person had a salary above $50,000. x1 should be a vector that contains all of the age values (divided by 10) and then x2 should be an indicator variable that indicates if the person is male or not. Taking columns from the matrix obtained using the model.matrix() may be useful here.

design_matrix <- model.matrix(adult_logit_mod)

y <- ifelse(adult_data$salary == ">50K", 1, 0)

x_1 <- design_matrix[,2]

x_2 <- design_matrix[,3]

Part f (5 points)

Use your gradient descent function from problem 1 to find estimates for \(\beta_0\), \(\beta_1\), and \(\beta_2\) for predicting salary from age and sex. Enter the following in your function:

  • Use the vector c(0, 0, 0) as your starting point.
  • Instead of starting the step size, \(\gamma_k\), at 1, start it at 0.1 to save some time with the line search.
  • Make sure the maximum number of iterations is at least 200.

You’ll know you did this right if the optimized values you end up with match (or are very close to) the estimates from your logistic regression model you fit in part d. This may take a minute or two to run.

start <- c(0,0,0)

# syntax for matrix multip. would be function(b, X, y){ sum(y * X %*% b)} etc.

ln_likely <- function(b){
  sum( y * (b[1] + b[2]*x_1 + b[3]*x_2) ) - sum( log( 1 + exp(b[1] + b[2]*x_1 + b[3]*x_2)) )
}

ln_likely_2 <- function(b){
  #exp_line <- exp(b[1] + b[2]*x_1 + b[3]*x_2)
  sum(y * log((exp(b[1] + b[2]*x_1 + b[3]*x_2))/(1+exp(b[1] + b[2]*x_1 + b[3]*x_2))) )+sum((1-y)*log(1/(1+exp(b[1] + b[2]*x_1 + b[3]*x_2))))
}
 
nabla_likely <- function(b){
  
  #exp_values <- exp(b[1] + b[2]*x_1 + b[3]*x_2)
  #exp_values <- pmax(exp_values, .Machine$double.eps) # clipping to avoid underflow
  
  d_beta_0 <- sum(y) - sum( exp(b[1] + b[2]*x_1 + b[3]*x_2) / (1 + exp(b[1] + b[2]*x_1 + b[3]*x_2)) )
  
  d_beta_1 <- sum(y * x_1) - sum((x_1 * exp(b[1] + b[2]*x_1 + b[3]*x_2) ) / (1 + exp(b[1] + b[2]*x_1 + b[3]*x_2)))
  
  d_beta_2 <- sum(y * x_2) - sum((x_2 * exp(b[1] + b[2]*x_1 + b[3]*x_2) ) / (1 + exp(b[1] + b[2]*x_1 + b[3]*x_2)))
  
  c(d_beta_0, d_beta_1, d_beta_2)
}

gradient_descent( x_k= start, f= ln_likely, gradient= nabla_likely, findmax = T, 
                 gamma=0.1 , beta=0.5 , max_iter = 300)
## [1] "The maximum is: -24483.07511215"
## [1] "Max at x_k / betas :"
## [1] -3.6298203  0.3828076  1.2420146
## [1] "with 174 iterations"
#gradient_descent( x_k= start, f= ln_likely, gradient= nabla_likely, findmax = T, 
#                 gamma=0.01 , beta=0.3 , max_iter = 3000)

Response: The likelihood maximization converges on the estimated parameters equivalent to those of the glm(family=binomial) model (to the nearest 10 thousandth ) after 174 iterations.

Part g (2 points)

Use the optim() function in R to find the estimate for the \(\beta\)’s here. This should be much faster since this function is much better optimized (no pun intended). The results here should also be very close to the numbers we obtained the slope in the model we fit in part d, but may not match exactly.

max_likely_optimal <- optim(par=start, fn=ln_likely, gr=nabla_likely, control = list(fnscale = -1 ) )

max_likely_optimal$par
## [1] -3.6304477  0.3829196  1.2420241

Response: The optim() function approximates the same parameters as the glm(family=binomial) model at least to the nearest 1000th.